Week 10 - Discussion Questions

These are example discussion points for you to think about before class. You are not expected to engage with all of them — pick the ones that speak most directly to your own research, and bring two or three rough answers to the in-class session. The full description of how to use these pages, including what the question tags mean, is on the Week 1 Discussion page.

Sub-lessons

What Agents Are, and What's New in 2026

Calibrate Pick a current product marketed as an “agent.” Using the lesson's working definition, decide whether it actually meets the bar, marketing language aside. What specific capability would have to be added or subtracted to flip your answer?
Apply For your own research, where does the move from “chatbot” to “agent” genuinely change what you can attempt — and where does it just add complexity without payoff?
Critical “The harness is the product” (10.1) is a strong claim. Make the case against it: where does the underlying model matter more than the harness, in ways the lesson under-weights?
Connect Week 9 covered illusions of understanding, sycophancy, and benchmark contamination as the limits of single-turn chat models. Now read those limits forward into the agentic frame here. Are agents structurally more vulnerable to those failure modes (because failures compound across steps), structurally less vulnerable (because tool use grounds them in external state), or both, depending on the task?

Failure Modes for Long-Horizon Tasks

Calibrate The Princeton reliability finding (accuracy ↑, reliability flat) is the cornerstone of this lesson. Take a specific long-horizon AI use you trust: do you have evidence its reliability has improved over the last 18 months, or have you been mistaking accuracy gains for reliability gains?
Apply Design a single “reliability test” for an agent you use — one you would run before delegating something serious to it. What does the test actually measure, and what does it leave unmeasured?
Critical The “Why Reasoning Fails to Plan” structural argument is bracing. Steel-man the case against it: are there long-horizon tasks where greedy step-wise reasoning is genuinely fine, even at scale?
Connect The structural-failure framing in 10.2 picks up directly from the “structural” category in 9.2. Are the structural failures in chat models and in agents the same family, or different families that look similar at first?

RAG in 2026

Calibrate Pick a RAG system you have used. Identify which of the lesson's failure modes (retrieval quality, chunking, context-window pressure, etc.) is most responsible for its current ceiling. What evidence supports that diagnosis?
Apply For your own research domain, sketch the document set that would make the best RAG corpus. Where would the corpus need most curation, and where can you trust default retrieval?
Critical RAG has been the dominant pattern for “making models smarter” for two years. When does it stop being the right tool — replaced by longer context windows, fine-tuning, or agentic retrieval?
Connect RAG is one specific instance of the broader “harness around the model” pattern from 10.1. Pick another instance from this week and compare what each adds and what each costs.

The Current Tool Landscape and MCP

Calibrate MCP is positioned as the “USB-C of AI tools.” Has that analogy aged well so far? Pick one concrete integration in your workflow and ask whether MCP made it noticeably easier, noticeably harder, or no different.
Apply Sketch the minimum set of MCP servers you would want in your own research setup over the next year. What does the list say about the kind of researcher you are trying to become?
Critical Tool standards lock in design choices. Which MCP design choices, in retrospect, will look most consequential five years from now — and could plausibly be wrong?
Connect The MCP ecosystem is the operational version of the “harness ecosystem” framing in 10.1. Does growing the MCP catalogue make the harness more or less important relative to the model?

Advanced Research Tools — A Curated Tour

Calibrate Pick the tool from the curated tour that you have not yet used but most expect to be useful. Predict where it will be strong and weak before you try it. Then try it. How accurate was the prediction?
Apply For your current research bottleneck, which two tools from the tour are most directly relevant, and which one would you adopt first? What would stop you?
Critical Curated tool tours are a snapshot in time. Which tools on the list do you predict will be obsolete by the time you finish your degree, and what skills will outlast them?
Connect Weeks 5, 6, and 7 each gave you a domain-specific tool tour (literature, writing, data). This sub-lesson is a cross-domain tour. Looking back across the four, which earlier domain-specific tool actually impressed you in practice, and which one's claims have not survived contact with your own work?

Hands-On Activities and Assessment

Calibrate Run one of the agentic activities on a real (low-stakes) task. Compare what the agent reported it did with the artifacts (intermediate files, logs, code). Where they diverge, what does that say about how much of the agent's self-report you can trust?
Apply Build a one-page personal “agentic toolkit” for your research — tools, prompts, CLAUDE.md, the works. What is the smallest version of this that you would actually maintain for a year?
Critical The activities reward you for catching failures. In practice, the failures that matter are the ones you don't catch. How would you design an activity that tests for those?
Connect The activities in Weeks 5, 6, 7, 8 and 9 each asked you to verify the output of a relatively short AI interaction. These activities ask you to verify the output of an hour-long autonomous agent run. Which of the earlier verification habits transferred cleanly to the agentic case, and which one had to be substantially rebuilt?